Stratum-specific health outcome estimation in Pakistan using double goal CART

Post-stratification is applied when the subpopulation membership is observed only for sampled values and the goal is to estimate stratum-specific parameters which leads the survey statisticians towards primary goals i.e., classification of non-sampled units into different strata and prediction of the values of the study variables. Regression models, on one side, optimize the prediction of the study variable’s non-sampled values while the classification algorithms, on the other side, look for the classification of non-sampled cases into different strata. Hence, it is crucial to deal with these two goals simultaneously for the estimation of stratum-specific parameters. This study introduces the idea of a double-objective classification and regression trees (CARTs) approach for estimating stratum-specific parameters. Theoretical properties of the total estimator are derived. An application on the estimation of health outcomes in different domains is given to delineate the practical significance as well as the efficiency of the proposed CART-based method. The proposed estimator of population total performs better than the existing stratum-specific estimator in terms of relative efficiency for all choices of parameters. As an ensemble model, the random forest CART outperforms the other competing tree-based models and homogenous population model without using any auxiliary variable.


Introduction
Survey statisticians have made major advances to the science of probability sampling, but most practitioners oppose the use of uncontrolled sampling due to the large variation in sampling units.Stratified random sampling controls the diversification with regard to the key study characteristics while maintaining the sample's probabilistic nature.Numerous studies have been published for the modification and enhancement of the stratification methods following Neyman's (1938) [1], groundbreaking work.Stratified sampling is justified only when the stratification variable is known prior to the sample selection.However, because such variables fluctuate over time and the sampling frame is built using census data from a few years ago, it is difficult to locate updated information about stratum indicators like house size, socioeconomic status, and education in the majority of health-related surveys.
Post-stratification, on the other hand, refers to the observation of the values of the study variable and the stratum membership variable after the sample has been chosen.For instance, a demographic survey typically can't stratify according to age, because the age of individuals is not available until the sample is collected.Post-stratification is the practice of using auxiliary data in finite population parameter estimation to improve the precision and accuracy of estimates of the population parameters.[2] uncovered the method of post-stratification that leads to an impressive reduction in the level of sampling to get reasonable estimates of the population.By using a sampling strategy known as multiple inverse sampling [3], attempted to overcome the challenge of post-stratification in high sample sizes, as post-stratification is as effective as ordinary stratification with proportional allocation.Moreover [4], focused on the effects of the post-stratification procedure used in labor force survey (LFS) and investigated whether one can obtain a more precise estimator of parameters using registered information in post-stratification by using auxiliary information or not.Predictive modeling was used to examine post-stratification, with population values assumed to be random variables produced by a model and population quantities inferred using those models' predictive capabilities [5].
Aside from design-based estimators, model-based approaches rely on the model relationship between the response variable and the predictors for enhancing the precision of estimates.Initially [6], used a straightforward regression model using auxiliary variables to predict the totals of the non-sampled units and their unknown and random quantities.[7,8] predicted the non-sampled values of the study variable in estimating the finite population total using a smooth function.Moreover, [9] introduced a model-based estimator that works with penalized spline regression function to obtain the model-assisted estimator of the total population by using the classical local polynomial regression (CLPR).After that [10], employed a modelbased approach to estimate the unknown parameter of the study variable using local linear regression (LLR).Similarly [11], analyzed data in complex surveys by considering the nonparametric estimation methods.Later [12], proposed a novel method for estimating a finite population parameter, which considers a linear combination of population values in a superpopulation scenario with a known basis function regression (BFR) model.[13] discussed applying linear, mixed, nonparametric, and machine learning techniques to estimate finite population parameters using complex survey data and auxiliary information Under commonly used feature selection criteria in machine learning, the suggested estimator's prediction error variance was computed.In order to apply machine learning predictions on unobserved data [14], suggested an active sampling technique for data subsampling.By overcoming design constraints, this method enhances performance in virtual simulation-based safety assessment of advanced driver assistance systems.Moreover [15], suggested a method that incorporates data from several sources with accepted practices to get estimates that are precise and reliable.They added that Big Data, which gives more rapid and detailed statistics, offers a solution to diminishing response rates and survey expenses.To make the transition from intended data to dataoriented statistics, it is necessary to comprehend the prerequisites for reliable inference.This goal is concretized through a number of statistical frameworks, however, these are broad approaches.
In machine learning, tree-based methods are favored due to ease of application and capturing linear and/or non-linear relationships between the variables without assuming a specific functional trend.Classification and Regression Tree (CART) algorithm to predict target variable values based on covariates and provide easily interpretable results.CARTs perform categorization and prediction based on observed data and can be employed for predicting the values of unobserved data.[16] examined the prediction performance of decision trees like CART and compared the results with those of other tree-based methods.Further [17], studied the prediction performance of decision trees like CART and a comparison has been also made between various tree-based algorithms.The performance of three non-parametric tree-based approaches was later examined by [18], for general forest mapping with high-resolution SPOT-HRG data because traditional methods like field surveys are time-and money-consuming.After that, [19] examined the effectiveness of software defect prediction as a research area in software engineering and the prediction capabilities of seven tree-based ensembles.Similarly [20], stated that the decision tree algorithm is the most important and efficient machine learning method.
[21] investigated the method of automatic diabetes prediction using random forest and gradient boosting classifiers.These tree-based ensemble methods with proper data processing, hyper-parameter tuning, and oversampling, can effectuate above 90% accuracy.Recently [22], developed a model-assisted technique based on random forests and estimated the functional relationship between the survey variable and the auxiliary variables.They also established the theoretical features of the procedure and calculated the associated estimator.Additionally, a model calibration process for dealing with numerous survey variables was covered.
When we need separate estimates in different study domains model-based approaches may be used for two purposes.First, the model is used for the classification of units in different study domains and secondly, a specific model will be applied for the prediction of non-sampled values.
The main contribution of the paper is to use two tree-based machine learning algorithms for obtaining separate estimates in different sup-population called domains, where the domain membership is observable only in the sample.The method proposed for the estimation of finite population parameters (population total) in this study uses a classification tree-based algorithm for classifying non-sampled units (the units not selected in the sample) into different strata (domains) and a regression tree-based algorithm for prediction of the values of the study variable on non-sampled units.To evaluate the performance of the estimator suggested in this study and to assess the applicability of the method we use bootstrap studies for two situations taking different health-related variables as the variables of interest.
In Section 2, we provide an overview of the classical model-based stratum-specific total estimator of population total with its finite sample properties.Section 3 provides the proposed tree-based algorithm for estimating stratum-specific total.Section 4 comprises bootstrap studies for two different cases to evaluate the performance of a stratum-specific total estimator.Section 5 concludes the study with some future recommendations.

Existing model-based estimation method
Let U = {1,2,3,. ...,N} be the set of serial number attached to the units in a finite population of size N.Further Y and X be the study and auxiliary variables with values y i and x i corresponding to the ith population unit for all i 2 U.The population consists of H mutually exhaustive strata whose membership are assumed to be unknown prior to the survey.The stratum membership variable for the t stratum can be defined as A hi which possess value a hi = 1 if i t unit belongs to t stratum, a hi = 0 otherwise, such that for i ¼ 1; 2; 3; . . .::; N and h ¼ 1; 2; . . .::; H; The stratum membership variable A hi for h = 1,2,. ..,H is defined as independently distributed Bernoulli random variables.The mean and variance of the product A h Y can be obtained as and The covariance between A hi Y i and A hi Y j for i6 ¼j8i,j2U is zero as Y i (conditionally) and A hi both are independent random variables.Following Chamber and Clark (2012) [23], the expansion estimator for the t stratum total tyh , is given by: where is the sample mean for the hth stratum and lh ¼ n h n is the estimator of λ h .The derivation of tE yh , after taking expectation of the prediction error, is given by with the unbiasedness condition The model variance of the prediction error can be obtained as The expansion estimator tE yh ¼ N n P i2s a hi y i with variance The expansion estimator tE yh has two attractive features one is BLUP property with respect to the model and the other is the compensation for unknown stratum size.However, the expansion estimator does not consider the known auxiliary information.However, classical model-based estimation approaches with auxiliary variables have filled this gap (see, [13]).When the model relationship between the study variable and the auxiliary variables is non-linear, we can no longer rely on the classical model-based estimators.The tree-based estimation procedure fills this gap and aids in efficiency improvement for the estimation of finite population parameter estimation.

Proposed model-based estimation method
Decision trees are non-parametric methods to screen the data into meager, extra "pure or homogenous groups known as nodes.An easy way to define "purity" is by increasing accuracy or by decreasing misclassification error.Decision tree models are suitable when there is a good reason to suspect non-additive interaction among variables or there are far too many variables under study.In general, depending on whether a statement is true or untrue, a decision tree will make a statement.CART is better at detecting this relationship than the use of interaction terms in linear models.Tree-based methods are favored for ease of application, captures linear and/or non-linear relationship between the variables without assuming a specific functional trend, and do not assume that all study variables are equal.To categorize the non-sampled units into strata and to predict the values of the study variable for the non-sampled part, two tree-based methods are used simultaneously and called it the double-goal classification and regression tree (DGCART) approach.Here, we modify the CART method to fulfil the dual objectives of stratification and prediction.The DGCART-based estimation algorithm is summarized in Table 1 and Fig 1.

Table 1. Illustration of DGCART for estimation of finite population total.
Step Description 1. Select a simple random sample of size n leaving N − n units as non-sampled from the population U.

2.
Observe the study variable Y, and the domain membership variable a h (h = 1,2,3,. ....,H).Utilize Attribute Selection Measure (ASM) to identify the dataset's top attributes.The Gini index is used as the measure of impurity or purity used to construct a decision tree.Finding the attribute that yields the most information gain is the key for building a decision tree.

4.
Classify the non-sampled units according to the tree grown in step 2.

5.
Grow a regression tree from the sampled data [Y: x1, x2, x3. . .x p ] for prediction of the study variable.

6.
Predict the non-sampled units according to the tree grown in step 4.An illustration for tree-based estimation of parameters in domain.The illustration is given in Fig 1. then classify the nodes into different strata according to the majority vote i.e., the lth node will be classified to hth stratum if n hl = max {n 1l , n 2l ,. ..,nHl }.The stopping rule for the classification tree is made by observing the increase in variance of A h (h = 1,2,. ..,H).We set a reduction in variation function as where is the set of classes at final node of classification tree and C F−1 is the set of classes at the node proceeding to the final node of the classification tree.At final node the λ h is estimated using the node specific data.i.e.
where n F is the total number of units at the final node n F ¼ P C F n l .The process continues until DfVðA h jC F ; C FÀ 1 Þg does not fall below a pre-specified value Δ o .At this stage one should ensure that the sample size at note t for a given h should be at least 2 i.e., n th �2.Once the stopping criterion is met, we get the classified data as C = {C 1 , C 2 ,. .., C l ,. ..,CL }.After classification of non-sampled units into different domains, we grow a regression tree from the sampled data i.e., [y: x 1 , x 2 , x 3 ,. ..,x p ] for prediction of the values of study variable.
Moreover, let C * ¼ fC * 1 ; C * 2 ; . . .; C * t ; . . .; C * T g be the set of nodes on regression tree constructed on sampled units.The values of the non-sampled units are predicted as the mean of the sampled values on a given node for example at t th node the value of y i is predicted as: The stopping rule is made for regression tree by observing the increase in variance of the study variable y at given node.We set a reduction in variation function as: where F is the set of classes at final node of regression tree and C * FÀ 1 is the set classes at the node proceeding to the final node of the regression tree.At final node, the mean, and the variance of is estimated using the node specific data for h th domain as: The predictive estimation problem starts with partitioning the total of h th stratum into sampled and non-sampled parts.The i th value of the study variable for non-sampled part is, then, predicted using the mean value of that class.The resulting tree-based estimator for domain total is given by P L l¼1 âhi n l .An estimate of r h can be obtained as: Inserting estimated value of rh in (13), we get When classification does not divide the data in a meaningful way the combined stratum mean � y hC * t coincide with the overall stratum specific mean � y sh i.e., � y hc * t ¼ � y sh and, as a result, the tree-based total estimator gives similar result as the expansion estimator, i.e.T yh ¼ lh N� y sh .The prediction error of the tree-based total estimator can be written as Applying model expectation on Eq (16), the prediction error given the set of classes C l is we have Inserting Eðâ hi � y t jC l ; a * hi ¼ 1Þ in Eq (17), we get the conditional bias as follow The Model bias term reduces as the sum of class specific mean reaches to overall population mean which is the worst situation in terms of efficiency.To increase in efficiency, we need to compromise some amount of bias in prediction process. Similarly Further, Variance of bias of Ttbh is given as:

Bootstrap studies
We conduct a bootstrapped study using Pakistan maternal mortality survey dataset, 2019 [24], for the two cases: (1) Taking the pregnancy loses as the study variable, and (2) Taking the delivery duration as the study variable.The dataset consists of N = 634 observations, after omitting rows having missing responses, with 28 variables (see details of the variables in Appendix A).Considering this dataset as the population, a simple random sample of size (n = 20,30,40,50,65, and 75) is drawn.Two separate trees are grown one for prediction problem and the other for classification of the non-sampled units using 5 different CART models and a random forest model.The summary of models used in this study are dscribed in Table 2.
We have used different decision tree tune parameters (hyper-parameters) to tune the tree.There are 5 different CART models having different values of hyper-parameters and one random forest model.Different tree parameters including the maximum depth which intended to prevent overfitting the specifics minimal number of observations needed in a node for split to be attempted is specified by "min split" and the "min bucket" (number of observations that are permitted in a terminal node) The value "None" indicates that we didn't utilize any values for the relevant hyper parameters in the model.
where h = 1,2, and Q denotes the number of simulations.Further, the mean square prediction error (MSPE) of the stratum-specific total estimator is obtained under different models as follow where h = 1,2, and Q denotes the number of simulation.
In R simple tree-based algorithms with some choices of tree size, splitting criteria, the number of trees to be produced, etc. are obtained using rpart package.The rpart employs a metric, like other partitioning algorithms, to choose the optimum rule for dividing the data.The method uses the Gini coefficient as the computational metric.

Case 1
Comparing Pakistan to other South Asian nations, Pakistan has the highest rate of pregnancy losses (30.6 pregnancy losses per 1000 total births) [25].There is a paucity of literature on the lived experiences of Pakistani women who have experienced multiple stillbirths, despite the well-documented psychological effects of stillbirths on bereaved women [26].Multiple stillbirths have a severe effect on women's emotional and social welfare, so in Case 1, the usage of contraceptive methods has been taken as the stratum membership variable, h = 1,2, (Stratum 1 and 2) and the number of pregnancy losses (which ranges in 1 to 20) has been taken as the study variable.
Table 3 shows bootstrap study results for Case 1.There are five different CART models according to tree parameters and one random forest (rf) model.The table provides lh which shows estimated the hth stratum proportion.The tables include myh the mean of respective stratum, the expected absolute prediction error (EAPE), mean square prediction error (MSPE) and relative efficiency (RE) of the estimators for different choices of sample sizes.Table 3 provides that the mean number of miscarriages is higher for Stratum 2 i.e., mothers who have ever used contraceptive methods against women who have not used any birth control.The mean pregnancy losses for the mothers who ever used contraceptive method is in the range [1.5299 to 1.5745] and for those who do not use contraceptive method is [1.3895, 1.4301] Decision trees are non-parametric methods to screen the data into meager, extra "pure, or homogenous groups known as nodes.An easy way to define "purity" is by increasing accuracy or by decreasing misclassification error.Decision tree models are suitable when there is a good reason to suspect non-additive interaction among variables or there are far too many variables under study.The average absolute deviation of the predictions from the true values of the parameter is obtained using the expected absolute prediction error (EAPE) measure for different models.No significant change is observed in EAPE values with a change in tree parameters, however, EAPE values have a slightly increasing trend with an increase in sample size which shows a trend of unbiasedness for larger sample sizes.Further, EAPE values are higher Stratum 1 as compared to Stratum 1. Similar to EAPE, there is no significant change in MSPE values with a change in tree parameters.However, the relative MSPE values corresponding to the random forest is significantly smaller than all other single-tree models.The relative efficiency (RE) values are greater than one for all combinations of tree parameters showing the superiority of tree-based total estimators to corresponding estimators under the homogenous model.However, the value RE of the random forest model is higher among all competing models due to the ensemble technique applied in random forest (rf) algorithms for building classification t and regression models for observed data.The comparison of different competing models used in this study is visually displayed in Figs 2 and 3 for Case 1 bootstrap study.forests build cumulative decision trees.The proposed tree-based strategy is particularly successful when applied in an ensemble model, as shown by the fact that the relative efficiency is even higher in the random forest model.

Case 2
Duration of delivery is a unique experience.Sometimes it's over in a matter of hours.Delivery duration is the time of procedure that will give birth to your child [27].As delivery duration is also an important variable which must be studied so in in Case 2, the usage of iron tablets during pregnancy has been taken as stratum-specific variable and the duration of delivery has been taken as study variable.
Table 4 shows bootstrap study results for case 2 i.e., "delivery duration" as the study variable and "usage of iron tablets" as stratum membership variable.There are five different CART models according to tree parameters and one random forest (rf) model.lh shows the stratum proportion for strata i.e., h = 1, and h = 2. myh represents the mean of respected stratum.
Table 4 shows that the mean estimated time for delivery is almost equal in both strata i.e., stratum 1 and stratum 2. The relative efficiency is greater than all other single tree models in all results because random forest always provide good results as compared to single trees as the number of trees are more than 1 i.e., 500 in random forest.
We assessed 5 classification and regression models and 1 random forest model in our study and determined the EAPE for each model.No significant change is observed in EAPE values with change in tree parameters, however EAPE values have a slightly increasing trend with increase in sample size.Further, EAPE values are higher in smaller stratum (h = 1) as compared to the larger one (h = 2).
Table 4 also provided the Mean Squared Prediction Error for 5 single classification and regression tree models and 1 random forest model.Similar to EAPE, there is no significant change in MSPE values with change in tree parameters.However, the relative MSPE values corresponding to random forest is significantly smaller than all other single tree models.The relative efficiency (RE) values are greater than one for all combinations of tree parameters showing superiority of tree-based total estimators to corresponding estimators without utilizing any tree.The results given in Table 4 can be visualized from Figs 4 and 5.As evidenced by its relative efficiency being higher in the random forest model and greater than 1 in all single tree models, our proposed tree-based method is more effective than the existing method.A value larger than 1 shows that the proposed technique is more efficient.Relative efficiency assesses the enhancement in efficiency of the proposed method over the existing method From both figures and tables we infer that the simultaneous application of classification and regression tree for stratification of non-sampled units assist in efficiency improvement when appropriate hype-parameters for trees are set for the training task.Ensemble different trees for the said two tasks provide the best performance of the total estimator for the variable of interest in different domains.

Conclusion
This study focused on classifying the non-sampled units into different strata using a classification tree algorithm and predicting the value of the study variable for the unobserved part of the population using a regression tree algorithm.Due to their ease of interpretation, and visualization tree-based algorithms are considered good alternatives to classical regression and classification models.The tree-based algorithms also deal with prediction and classification problems when the parametric relationship between the study variable and the predictors is ambiguous.Due to these attractive features, the tree DGCART method is proposed for estimating stratum-specific parameters.With random forest decision trees one can make predictions from different random samples of covariates rather than selecting the best ones and enhance the precision of the estimators proposed.Bagging in random forests also provides a direct estimate of prediction variance that can be considered in future studies.Similar studies, where stratum-specific estimates are needed, can benefit from the current study's representation of how various input factors might be used to forecast a target value and utilized in the estimation stage.The DGCART algorithm is especially useful in obtaining estimates of different indicators in specific demographic, socio-economic and geographic subpopulations in health related surveys where the indicator of interest has a high proportion of missing observations.Where missing part of the actual sample can be considered as the non-sampled part.

Fig 2
graphically compares the relative efficacy of five single tree models and one random forest model.With n = 50, all single CART models exhibit comparable relative efficiency.Every model has a different relative efficiency for single CART models with n = 65, but models 2 and 3 have better relative efficiencies.According to the models, relative efficiency for n = 75 has changed, with model 4 having lower efficiency.The relative effectiveness of random forest is higher than that of the other individual CART models for all samples.Fig3compares the relative efficacy of CART models with various sample sizes for the number of miscarriages per woman who did not use any kind of contraception before or throughout her pregnancy.By comparing the relative effectiveness of models with n = 75 to models with n = 50 and n = 65 in single CART models, we may determine that larger sample sizes can yield more accurate population estimates.The relative efficiency for all single CART models is about the same for n = 50, and it is even the same for n = 65.We have superior relative efficiency in the random forest model as compared to other single CART models because random

Fig 4
provides a graphical representation the relative efficiency of the mean estimator when delivery duration is used as the study variable.For n = 20, 30 and 50, we have 5 single CART models and 1 random forest model.All of the models' relative efficiency patterns are more than 1.20, with the random forest models' pattern exceeding 1.65.In comparison to larger sample sizes, the relative efficiency for all single CART and random forest models is relatively low for n = 20.

Fig 5
Fig 5 illustrates the relative efficacy of mean estimator when delivery time is used as a variable of interest.The random forest model is more efficient than all single CART models, with a relative efficiency of more than 1.40.It is simple to determine that the relative efficiency varies in accordance with the hyper-parameter values in each single CART model.As evidenced by its relative efficiency being higher in the random forest model and greater than 1 in all single tree models, our proposed tree-based method is more effective than the existing method.A value larger than 1 shows that the proposed technique is more efficient.Relative efficiency assesses the enhancement in efficiency of the proposed method over the existing method From both figures and tables we infer that the simultaneous application of classification and regression tree for stratification of non-sampled units assist in efficiency improvement when appropriate hype-parameters for trees are set for the training task.Ensemble different trees for the said two tasks provide the best performance of the total estimator for the variable of interest in different domains.